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Abstract 

In this paper we present the application of a boosting classification algorithm to 
confidence scoring. We derive feature vectors from speech recognition lattices and 
feed them into a boosting classifier. This classifier combines hundreds of very simple 
'weak learners' and derives classification rules that can reduce the confidence error 
rate by up to 34%. We compare our results to those obtained using two other standard 
classification techniques, Support Vector Machines (S VMs) and Classification and Re- 
gression Trees (CART), and show significant improvements. Furthermore, the nature 
of the boosting algorithm allows us to combine the best single classifier and improve 
its performance. 

We present experimental results on real world corpora derived from our SpeechBot 
Web index http://www.speechbot.com and from the HUB4 DARPA evaluation sets. We 
believe these results have wide applicability to audio indexing and to acoustic and 
language modeling adaptation where word confidence scores can be used in iterative 
adaptation schemes. 
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1 Introduction 

Speech recognition technology has advanced to the stage where real-world applications 
are feasible. However, due to the current imperfect nature of speech recognition, confi- 
dence scoring has emerged as an important component of current systems. Confidence 
scoring attempts to assign 'trust' to the hypotheses produced by speech recognition 
systems. 

We are interested in audio indexing systems for the Web. Confidence scores can 
be very useful for such systems where an enormous amount of data is indexed and the 
ground truth is not known. For example, our speech indexing system SpeechBot [10] 
indexes close to 9000 hours of untranscribed audio content. A good confidence scorer 
could enable us to make use of such data, either for acoustic and language model adap- 
tation or even for retraining [9]. We could also use confidence scores to improve our 
indexing function. 

The literature contains many examples of techniques for word confidence scoring. 
Typical approaches form a feature vector by concatenating or otherwise combining 
one or more basic features correlated with word confidence, including basic features 
of adjacent words. One of a variety of classifiers is then applied to this vector to 
determine confidence for the word. Features based on the acoustic model {e.g. see 
[12]), the language model {e.g. [16]), the decoding process {e.g. [17, 8, 19, 5, 7]) and 
word semantics [4, 11]) have been proposed. Classifiers investigated include simple 
thresholding [19], linear discriminant analysis followed by a linear thresholds [12, 11], 
Bayes classifiers [5], neural networks [17, 12, 8, 18], generalized linear models [7, 14] 
and decision trees [8, 1 1]. 

In this paper we explore the use of boosting techniques to classify confidence fea- 
ture vectors. Boosting combines hundreds or even thousands of very simple classifiers 
(called 'weak learners' in the Machine Learning literature) by a weighted sum. Each 
classifier focuses its attention on those vectors on which the previous classifier fails. 

The use of boosting classifiers with the choice of weak learners proposed in [15] 
offers us the unique advantage of being less sensitive to spurious features. That is, 
components of the confidence feature vector that do not add any advantage are ignored 
at the expense of more promising features. Additionally, we are able to analyze the 
relative importance of each feature in a principled way. A simple inspection of the 
weak learners highlights those features that contribute most to classification. 

2 Confidence Features 

We use a fairly standard set of confidence features augmented with one novel feature 
to form a feature vector for each hypothesized word. Since our boosting classifier will 
ignore components that supply spurious information, there is no harm in including as 
many features as possible (other than wasted processing time). Our basic set of features 
is listed in Table 1 . 

In addition to this basic set, we include context information for each word. We form 
the final confidence feature vector for each hypothesized word as the concatenation of 
the feature set in Table 1 for that word, and the corresponding sets for the most likely (in 
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3 BOOSTING CLASSIFIER 



Component 


Basic Feature 


0 


word graph probability e.g. [19] 


1 


hypothesis density at word beginning 


2 


hypothesis density at word end 


3 


average hypothesis density over the word 


4 


hypothesis density at preceding frame 


5 


hypothesis density at following frame 


6 


acoustic score 


7 


unigram score 


8 


word length in frames 


9 


word length in phones 


10-12 


3D point representing the first phone of the 




word (explained in the text) 



Table 1 : Core feature set used to construct the feature vector for each hypothesized 
word. This vector is augmented by left and right context as described in the text. 



the Viterbi search sense) preceding and following words. Our final confidence feature 
vector thus has dimension 39. 

Our one novel feature is a 3D representation of the first phone of each word. Our 
motivation is that we wish to include more information about the intrinsic confusabil- 
ity of words in confidence scoring metrics. However, since there is no simple low- 
dimensional monothetic representation of word confusability, we approximate it by the 
confusability of the first phone of the word. This is reasonable since an error at the 
beginning of the word will impact the whole word. Indeed, many words begin with 
easily confusible consonants. 

We represent the confusability of the first phone in the word by transforming the 
phone label to a real three-dimensional point using Multi-dimensional scaling (MDS). 
This transformation from a label to the real space allows us to treat this feature nu- 
merically, similar to all other features. MDS (e.g. [20]) is a standard technique which 
transforms a series of objects, about which only relative distance information is avail- 
able, to a series of N -dimensional points. The mapping attempts to preserve the relative 
distances between objects such that objects which are known to be 'close' to each other 
are 'close' in the N dimensional space. To transform phone labels using MDS, we use 
a phone confusion matrix as a measure of the relative distance among them. Figure 1 
shows our 3D representation of TIMIT phones derived using MDS on their confusion 
matrix. We see that linguistic categories are well preserved in this Euclidean space. We 
use this mapping to obtain a 3D point for the first phone of each word. 

3 Boosting Classifier 

Boosting is a novel approach to classification which has lately received much attention 
due to its simplicity, elegance, power and ease of implementation. The basic ideas and 
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Figure 1 : 3D Euclidean representation of TIMIT phones derived using MDS on their 
confusion matrix. For clarity, only points close to the origin are shown. 



algorithms were introduced by Schapire [13] and Freund [6]. 

Boosting applies a classification procedure iteratively to a set of weighted data vec- 
tors. At first each vector is assigned an equal weight (or a weight depending on its prior 
probability). On each iteration, a classifier is learnt and the vectors that are classified 
incorrectly have their weights increased while those that are correctly classified have 
their weights decreased. The intuition is that vectors which are difficult to classify 
receive more attention on subsequent iterations. 

The classifier learnt at each iteration is called a 'weak' classifier. It is called weak 
because it is not expected to classify the training data very well, only better than 
50%. Typically a very simple weak classifier is used. The final classifier, the so-called 
'strong' classifier, is formed as a weighted sum of the weak classifiers learnt at each 
step. Table 2 gives a algorithmic description of the boosting classification procedure. 

The formal guarantees provided by boosting classification theory are quite strong. 
Freund and Schapire prove that the training error of the strong classifier approaches 
zero exponentially in the number of iterations. 

3.1 Choice of Weak Learner 

The boosting algorithm does not impose any restriction on the nature of the weak 
learner. Any classifier that does a better job than pure chance is acceptable. In this 
paper we have experimented with a rather simple weak learner. We use a variant of 
AdaBoost [6] proposed by Tieu and Viola [15] in which the weak learner is a simple 
threshold that depends on a single component of the feature vector. This weak learner 
examines the feature vector and finds the component and threshold that best separates 
the two classes. Therefore each weak learner hj(x) is identified by a feature compo- 
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3 BOOSTING CLASSIFIER 



• Begin with N training vectors Xi and their associated labels yi where y 4 = 0, 1 for 
negative and positive examples respectively. 

• Initialize weights wi^ = for j/j = 0, 1 respectively, where m and / are the 
number of negatives and positives respectively. 

• Fort = 1,...,T: 

1. Normalize the weights, 

w t i 

m,i <- ^ n ' 

Ej=i w tj 

so that is a probability distribution and adds up to 1.0. 

2. For each feature, j, train a classifier hj which is restricted to using a single 
feature. The error is evaluated with respect to Wt, tj = E« \hj(xi) — yi\. 

3. Choose the classifier, h t , with the lowest error e t . 

4. Update the weights: 

W t +l,i = w t ,iP]~ ei 

where a = 0 if example Xi is classified correctly, e, = 1 otherwise, and 
A = T^7- 

• The final strong classifier is: 

h f x ) = { 1 Ef=i a*ftt(a:) > | Ef=i «t 
\ 0 otherwise 

where a t = log 



Table 2: The boosting algorithm for learning a classifier. T weak classifiers are con- 
structed. The final strong classifier is a weighted linear combination of the T weak 
classifiers where the weights are inversely proportional to the training errors. 



nent fj, a threshold 6j, and a direction dj indicating the direction of the inequality 
sign. 



h .t T \ - J 1 ifd ifj( x ) < d i e 3 m 
- \ 0 otherwise u; 



In practice no single feature component can perform the classification task with low 
error. Typically the first weak learner has an error rate of about 0.3 and the final learners 
closer to 0.5. Figure 2 shows weak and strong learner error rates for the HUB496 data 
set as a function of the number of iterations. We see that the strong error rate converges 
to 0.2 while the weak learner converges to around 0.5. 
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Figure 2: Strong and weak error rates as a function of the number of iterations in the 
boosting algorithm. The dataset was a subset of the HUB4 confidence set. 



4 Alternative Classifiers 

In addition to boosting, we also experiment with alternative classifiers on our feature 
set. We use standard implementations of Support Vector Machines (SVM) [2] and 
Classification and Regression trees (CART) [1]. 

5 Experimental Results 

We test our algorithm on confidence features obtained from two data sets. The first 
set is the 1996 HUB4 test set [3], a total of about 3 hours of speech. The second set 
is sampled from around 9 hours of transcribed Web Broadcast News from our internal 
SpeechBot test set [10]. 

To obtain lattices from which our confidence features are extracted, we run a stan- 
dard HMM-based decoder built on the HUB496 and HUB497 training sets. For the 
SpeechBot test set, the training data is Real-Audio encoded and decoded to account for 
the streamed nature of the test set. The decoder for the HUB4 data uses 16 Gaussian 
mixture components per state. For the Speechbot data, 8 mixture components are used. 
The word error rates for the data sets are 32.9% and 55.0% respectively. 

Using the decoded word lattices, We construct confidence feature vectors as de- 
scribed in Section 2 for each word in the top hypothesis. Each feature is labeled with 
'1' or '0', reflecting whether or not the word is correct. Table 3 gives further details of 
the feature sets, including the baseline error or prior probability of Class 0. Notice that 
the error rates for the confidence vectors are not the same as the recognizer error rates. 
This is because deleted words, which count as errors for word error rate scores, do not 
appear in confidence feature sets (since there is no word to obtain features for). 



Data Set 


Nr. Vectors 


Baseline Error 


HUB496 

SpeechBot 


43k 
43k 


29.0% 
41.7% 



Table 3: Details of the HUB4 and SpeechBot confidence feature sets 
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6 DISCUSSION 



For all the experiments reported in this paper we perform cross validation. The data 
sets were randomized and split into 10 different sets. Training was performed on 9 sets 
and testing on the remaining set. This experiment was repeated 10 times by testing on 
all 10 sets. Our error rates are therefore averages over all 10 sets. This experimental 
method provides more accurate and valid results. 

We also report the confidence error rates for both classes. Any classifier can be 
tuned to minimize global error rate or to minimize false positives or false negatives. In 
this paper we tune our classifiers to operate close to the equal error rate point where 
both false positives and false negatives are similar. Otherwise, our results will be biased 
by the prior probabilities of each class. 

5.1 HUB496 results 

Table 4 shows the results of tests on the HUB4 dataset. We show error rates for boosting 
systems with up to 200 weak learners. We did not observe significant improvements 
beyond this number. The results show that we can reduce the error rate to 25.5%, an 
improvement of 12.1% relative to the baseline of 29.0%. 



Number of 


Class 1 


Class 0 


Total 


weak learners 


Error Rate 


Error Rate 


Error Rate 


1 


30.4% 


28.9% 


28.1% 


50 


27.6% 


27.4% 


26.1% 


100 


27.4% 


27.1% 


25.9% 


200 


28.0% 


26.3% 


25.5% 



Table 4: Error rates for the HUB4 96 data set and their relationship to the number of 
weak learners. 

On this set, the CART classifier produces an error rate of 28.1%, almost no im- 
provement over the baseline. The SVM classifier yields an error rate of 31.2%, again 
no improvement. 

5.2 SpeechBot results 

Table 5 presents results for the Speechbot dataset. Again, we show error rates for up to 
200 weak learners. A substantial improvement over the baseline result is observed. We 
improve the error rate from 41.7% to 27.6%, a relative improvement of 33.8%. On this 
dataset, the CART classifier produces an error rate of 28.4% and the SVM classifier an 
error rate of 32.6%. 

6 Discussion 

We observe that our boosting classifier outperforms both S VMs and CART classifiers. 
Even on the HUB 96 dataset where the CART and SVM classifiers failed to yield any 
improvement boosting gave a 12.1% relative improvement. 
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Number of 


Class 1 


Class 0 


Total 


weak learners 


Error Rate 


Error Rate 


Error Rate 


1 


28.6% 


34.4% 


31.9% 


50 


26.1% 


29.7% 


28.2% 


100 


25.5% 


29.2% 


27.7% 


200 


25.7% 


28.9% 


27.6% 



Table 5: Error rates for the SpeechBot data set and their relationship to the number of 
weak learners. 




Illlllllllllllll 



Weak Learner 



Figure 3: Weights applied to each of the weak learners. The first five learners contribute 
close to 25% to the decision. 



Because our choice of weak learner is a dimension specific classifier, it is inter- 
esting to examine when each feature component is chosen by the boosting iterative 
procedure. Intuitively, confidence vector features that are chosen early are more infor- 
mative than those chosen later on. Using this simple analysis we observe that features 
3, 0, 3, 1, 7 and 11 are the first six features chosen by the strong classifier. These 
features correspond to the average hypothesis density over the word, the word graph 
probability, the hypothesis density at the word beginning, the unigram score and the 
middle component of our 3D representation of the first phone in the word. This order 
of feature choice is relatively consistent across experiments and datasets. Interestingly, 
our 3D phone representation is more informative than many of the other lattice-derived 
features. 

Figure 3 displays typical weights applied to the first thirty weak learners. We ob- 
serve that the features for context words (from components 14 to 39) that provide an- 
other 26 components to our 39 dimensional vector only appear after position 10 or so. 
In fact, after learning 100 weak learners only 26 out of the possible 39 are chosen. This 
is due to the fact that our boosting implementation plays a dual role of learning classi- 
fiers and picking those features that are more promising in classifying the data correctly. 
This characteristic of our boosting implementation could be used as a preprocessor to 
extract informative features from an arbitrarily large set to aid dimensionality reduction 
in other tasks. 
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7 Conclusion 

In this paper we have explored the use of boosting techniques for confidence scoring. 
We have compared them with two other classification schemes, CART and SVMs, 
and consistently outperformed them. Our choice of boosting algorithm offers several 
advantages. It is simple to implement, fast in its learning time, and very flexible in 
the choice of weak learner. In this paper we have used a very simple learner that 
picks individual features and classifies them with a threshold and a flag indicating the 
direction of the inequality sign. Remarkably, such a simple classifier is able to provide 
up to a 34% improvement in performance on the SpeechBot dataset. More sophisticated 
weak learners such as CART should be able to improve this performance at the cost of 
longer training time. 

In the future we will explore how confidence scores can be used to improve our 
public audio indexing system, both to refine the retrieval function as well as for lan- 
guage and acoustic model adaptation. Confidence scores will allow us to effectively 
mine more than 9000 hours of unlabeled audio currently indexed by the SpeechBot 
system. 
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